Skip to content

feat: make mesh accept meshcontext#2266

Merged
akoumpa merged 45 commits into
mainfrom
akoumpa/refactor_auto_class_public_api
Jun 9, 2026
Merged

feat: make mesh accept meshcontext#2266
akoumpa merged 45 commits into
mainfrom
akoumpa/refactor_auto_class_public_api

Conversation

@adil-a

@adil-a adil-a commented May 18, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Refactors the distributed public API so topology and distributed policies are layered explicitly.

The main user-facing object is now DistributedSetup, which owns:

  • mesh_context: runtime topology and DeviceMesh / MoE mesh access
  • strategy_config: FSDP2 / Megatron FSDP / DDP strategy config
  • pipeline_config: pipeline-parallel runtime config
  • moe_parallel_config: MoE parallelization config
  • activation_checkpointing: activation-checkpointing policy

MeshContext is narrowed to topology only. It no longer owns activation checkpointing or higher-level training policy.

Changelog

  • Add DistributedSetup.build(...) as the component-layer entry point for constructing distributed setup from strategy, parallelism sizes, pipeline config, MoE config, and activation checkpointing.
  • Keep device_mesh compatibility in NeMoAutoModel*.from_pretrained by wrapping raw HF-style meshes into an internal topology-only DistributedSetup.
  • Remove legacy device_mesh.py and move raw mesh construction/access helpers into mesh_utils.py.
  • Introduce ParallelismSizes for dp/tp/pp/cp/ep sizing intent.
  • Move MoEParallelizerConfig into distributed config, since it is part of distributed setup rather than model-only MoE config.
  • Update recipes to build a single DistributedSetup from YAML/programmatic config and fan out the derived runtime attributes consistently.
  • Update diffusion, LLM, VLM, KD, retrieval, and sequence-classification callsites to use the new setup layering.
  • Update tests for the new layering and raw device_mesh compatibility.

API shape

Python usage:

from nemo_automodel.components.distributed import DistributedSetup, FSDP2Config, ParallelismSizes
from nemo_automodel import NeMoAutoModelForCausalLM

distributed_setup = DistributedSetup.build(
    strategy=FSDP2Config(sequence_parallel=True),
    parallelism_sizes=ParallelismSizes(tp_size=2, ep_size=8),
)

model = NeMoAutoModelForCausalLM.from_pretrained(
    "model/name",
    distributed_setup=distributed_setup,
)

HF-compatible raw mesh usage is still allowed:

model = NeMoAutoModelForCausalLM.from_pretrained(
    "model/name",
    device_mesh=device_mesh,
)

Future work

Currently FSDP2Config is not pure FSDP, but also includes options for TP/SP; those will be refactored in a follow-up PR to separate concerns.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

Validation:

  • python -m ruff check ...
  • python -m ruff format --check ...
  • python -m py_compile ...
  • pytest tests/unit_tests/recipes/test_dist_utils.py -q

Note: local full recipe test collection is blocked in my environment by an existing mlflow / cachetools.func.cached import mismatch. CI should be used for full CPU coverage.

Additional Information

This keeps the TorchTitan-like layering:

  • sizes: ParallelismSizes
  • topology: MeshContext
  • distributed policies and topology bundle: DistributedSetup
  • recipe/YAML adapter: create_distributed_setup_from_config

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a

adil-a commented May 18, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test 3dcadfb

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa

akoumpa commented May 18, 2026

Copy link
Copy Markdown
Contributor

/ok to test a8b2df6

@akoumpa

akoumpa commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

/ok to test 4a4ba1a

…ntrol (#2444)

* feat(speculative): add reasoning mode control for EAGLE/P-EAGLE/DFlash training

Add --reasoning {none,save,disable} flag to regenerate.py for controlling
whether target model reasoning content is preserved or suppressed during
data regeneration. Add mask_reasoning_content option to EAGLE/P-EAGLE/DFlash
training recipes to exclude reasoning traces from the loss mask.

Co-authored-by: khazic <khazzz1c@gmail.com>
Signed-off-by: thyways <2484113689@qq.com>
Signed-off-by: khazic <khazzz1c@gmail.com>

* feat(speculative): add EAGLE-3 sequence packing for draft training

Pack variable-length chat samples into fixed-width rows for EAGLE-3
training, removing the per-sample padding waste of the default
max_length path. Documents within a row attend block-causally: the
target uses a 4D block-causal mask (SDPA) and the draft uses varlen
FlashAttention-2; cross-document TTT supervision is gated by
doc_remaining so deeper steps never leak across boundaries. Opt-in via
packed_sequence_size > 0, colocated target backend only. Covered by
unit tests plus an FA2-vs-eager parity test.

Co-authored-by: khazic <khazzz1c@gmail.com>
Signed-off-by: thyways <2484113689@qq.com>
Signed-off-by: khazic <khazzz1c@gmail.com>

---------

Signed-off-by: thyways <2484113689@qq.com>
Signed-off-by: khazic <khazzz1c@gmail.com>
Co-authored-by: thyways <2484113689@qq.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>

@jgerh jgerh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review of docs/guides/gradient-checkpointing.md and provided a few suggestions

Comment thread docs/guides/gradient-checkpointing.md Outdated
Comment thread docs/guides/gradient-checkpointing.md Outdated
Comment thread docs/guides/gradient-checkpointing.md Outdated
Comment thread docs/guides/gradient-checkpointing.md Outdated
Comment thread docs/guides/gradient-checkpointing.md Outdated
Comment thread docs/guides/gradient-checkpointing.md
yuhezhang-ai and others added 9 commits June 8, 2026 15:10
…2389)

* feat(distributed): add selective activation checkpointing for FSDP2

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* fix(distributed): support selective activation checkpointing with torch.compile

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* docs(fern): drop selective AC from frozen v0.4 snapshot

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* feat(distributed): honor selective activation checkpointing on single GPU

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* feat(moe): support selective activation checkpointing with expert parallelism

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* fix(model): make DeepSeek MLP dispatch wrapper-safe

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* fix(distributed): save expert grouped-GEMM in selective AC and add op trace

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* feat(moe): compile selective activation checkpointing wrappers outer

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* refactor(distributed): move selective AC into its own module

Extract the TorchTitan-style selective activation checkpointing core out of
the central parallelizer.py into a dedicated activation_checkpointing.py:
op-set construction, the save/recompute policy, block/sub-module wrappers,
KV-sharing detection, and the compile-outer wrapper flag. parallelizer.py
keeps only the thin apply_selective_activation_checkpointing entry point,
which still needs the heavy, transformers-aware _extract_model_layers, so the
dependency stays one-directional (parallelizer -> activation_checkpointing ->
parallelizer_utils) with no circular imports.

Move the opt-in NEMO_SELECTIVE_AC_TRACE diagnostic out of parallelizer.py into
parallelizer_utils.maybe_trace_selective_ac_decision so the hot policy is a
single call site instead of trace globals plus a helper.

Make the new module's cross-module interface public (drop the leading
underscore) and keep internal op-resolution/plumbing private. Update the moe
and fsdp2 consumers and the unit tests to import from the new module.

Also fix doc wording: clarify that torch.compile must be held fixed when
comparing full vs. selective, and refer to TorchTitan as a reference
implementation rather than "upstream".

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* refactor(distributed): move selective-AC trace into the AC module

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* test(distributed): patch activation_checkpointing.checkpoint_wrapper after AC module split

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* docs: apply tech-writer edits to gradient-checkpointing guide

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

---------

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
* ci: add nemo-run, split qwen-vl-utils from decord for arm

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: override in pytorch container

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update uv lock

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

---------

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…2419)

* fix(transformers): unify loaded HF dtype via promote_types

Make _restore_loaded_model_dtype dtype-aware: instead of always restoring to
the checkpoint dtype, unify each floating tensor to promote_types(checkpoint,
requested). This honors an explicit fp32 request while preserving
intrinsically-fp32 checkpoint params (e.g. A_log) under a bf16 request, and is
a no-op for the bf16/auto path. Fixes FSDP2 uniform-dtype tripping on
HF mixed-dtype loads.

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* feat(distributed): default pipeline dtype to FSDP activation dtype

When pipeline parallelism is enabled and pipeline.dtype is unset, derive it from
the FSDP mixed-precision activation dtype (mp_policy.output_dtype, falling back to
param_dtype) so pipeline stage shape inference matches the real activation dtype
(e.g. bf16 compute under fp32 master weights). An explicitly set pipeline.dtype is
honored but warned on mismatch, since it can corrupt inter-stage recv buffers.
No-ops for strategies without an mp_policy (e.g. MegatronFSDP) and for pp_size==1.

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
(cherry picked from commit 3f6b246)

* refactor(distributed): resolve FSDP compute dtype per-param, decoupled from storage

fully_shard_by_dtype now groups parameters by their required *compute* dtype
instead of their storage dtype, so fp32 master weights (uniform fp32 storage)
still compute the bulk in mp_policy.param_dtype (bf16) while intrinsically-fp32
params keep fp32 compute.

Per-parameter compute dtype is resolved by precedence: pinned fp32
(_keep_in_fp32_modules_strict) > HF-recorded checkpoint dtype (tagged onto each
tensor at load time in _restore_loaded_model_dtype) > mp_policy.param_dtype.
Qwen3.5's GatedDeltaNet fp32 holder is declared via patch_hf_model; the
NemotronH and Qwen3.5 strategies thread the declaration through.

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
(cherry picked from commit 3dd6b97)

* docs(model-onboarding): document _keep_in_fp32_modules_strict contract

Add SKILL.md §2.6 explaining which params must compute in fp32 (SSM A_log/
dt_bias/D, MoE sigmoid-gate bias, attention-sink bias, scale), how to declare
them (class attribute vs patch_hf_model instance attribute), and why the pin is
the robust signal across all load paths. Broaden the MoE checklist item and
code comment accordingly.

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
(cherry picked from commit a11db38)

* test(distributed): add fp32 compute-dtype contract test

Assert the resident compute dtype of every trainable parameter across the model
archetypes that use fully_shard_by_dtype (dense, Qwen3.5-style hybrid), covering
the full precedence chain: pinned fp32 > HF-recorded dtype > mp_policy.param_dtype,
under fp32 master weights and ordinary loads.

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
(cherry picked from commit dc83926)

* feat(model): cast frozen modules to compute dtype to avoid mismatch

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
(cherry picked from commit d321f5e)

* refactor(gemma4): drop projector dtype hook now general frozen cast handles it

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
(cherry picked from commit 1bc67e2)

* feat(training): add dormant resolve_storage_dtype helper

Add resolve_storage_dtype() (and its unit tests) for defaulting model.torch_dtype
to fp32 for full-parameter torch.optim training. Not yet wired into recipes here;
the call sites are marked with breadcrumb comments and enabled in a follow-up PR,
keeping this PR limited to dtype bug fixes with no behavior/memory change.

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* fix(model): cast frozen-module buffers and unsharded params to compute dtype

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* docs(infra): correct frozen-tower FSDP comment to match sharding reality

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* docs(mixed-precision): clarify TE vs torch AdamW memory and precision trade-offs

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* docs(mixed-precision): apply tech writer edits

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

* docs(mixed-precision): drop unresolvable FSDP anchor

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

---------

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
…2448)

Add examples/speculative/README.md covering the whole speculative-decoding
draft-training subsystem: supported methods (EAGLE-1/2/3/3.1, P-EAGLE,
DFlash), target-model registry coverage, compute backends (eager vs
flash_attention_2, flex_attention/sdpa, fused Triton soft cross-entropy,
d2t/t2d draft-vocab compression), target backends (co-located, remote,
offline cache), serving and benchmarking, inference-engine compatibility,
and a consolidated config reference.

Fold the standalone regenerate_with_target.md into the README's data
preparation section (full two-step flow, tuning table, pitfalls) and remove
the separate file so there is a single entry point.

Signed-off-by: khazic <khazzz1c@gmail.com>
)

* feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support

Signed-off-by: linnan wang <linnanw@nvidia.com>

* fix the memory management for training large 14B wan model

* fix wan2.2 support

* all good for wan2.2

* update

Signed-off-by: linnan wang <linnanw@nvidia.com>

* docs(fern): add Wan2.2 T2V-A14B model coverage and release log entry

Signed-off-by: linnan wang <linnanw@nvidia.com>

* fix anther round of code review

* fix(diffusion): sort wan.py imports to satisfy CI isort (I001)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(diffusion): load inference checkpoints to CPU to halve peak GPU memory

Avoids doubling peak GPU memory (and a potential OOM in Wan2.2 two-stage
inference) by loading EMA/consolidated state dicts with map_location="cpu";
load_state_dict copies into the already-on-device parameters.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Signed-off-by: linnan wang <linnanw@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Resolve conflicts between the MeshContext/DistributedSetup refactor and
main's selective activation checkpointing (#2389), FSDP2 dtype fixes (#2419),
and DDP find_unused_parameters:

- config.py: keep the DistributedSetup/MoEParallelizerConfig refactor and the
  DistributedStrategyConfig rename; fold in ActivationCheckpointingMode + a
  back-compat DistributedConfig alias; widen DistributedSetup.activation_checkpointing;
  DDPConfig gains find_unused_parameters and drops backend.
- mesh.py: MeshContext stays pure topology (strategy/pipeline/moe/AC fields removed);
  main's AC-type change there is moot.
- infrastructure.py: keep moe_parallel_config param + cast_frozen_modules import;
  drop the relocated moe.config MoEParallelizerConfig import; widen activation_checkpointing.
- ddp.py / diffusion: preserve find_unused_parameters via DDPConfig, drop backend.
- multimodal/finetune.py: fix moe_config= -> moe_parallel_config= to match the API.
- tests: align dist_utils + diffusion DDP tests with the new DistributedSetup API.
Pull in Wan2.2 two-stage finetuning (#2284). The only conflict was the
diffusion FSDP2 manager_args build: keep the PR's
_build_diffusion_parallel_manager_args helper and teach it to honor
fsdp.cpu_offload -> CPUOffloadPolicy so #2284's CPU-offload support is
preserved through the refactored path.
@akoumpa

akoumpa commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test 7a55fc1

The DDP strategy config exposes find_unused_parameters (default False),
so _build_diffusion_parallel_manager_args returns it in the ddp branch.
Update the test's expected dict to match, fixing the L0 unit test failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi

Copy link
Copy Markdown
Contributor

/claude review

@HuiyingLi

Copy link
Copy Markdown
Contributor

/ok to test 42b703f

Comment on lines +321 to +326
if moe_parallel_config is None:
moe_parallel_config = MoEParallelizerConfig()
parallelize_fn = partial(
parallelize_model,
activation_checkpointing=activation_checkpointing,
**moe_parallel_config.to_dict(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: the old code forwarded model_wrapper.mp_policy (from FSDP2Config) to the MoE parallelizer when MoEParallelizerConfig.mp_policy was None:

# old code
moe_kwargs = moe_config.to_dict()
if moe_kwargs.get("mp_policy") is None and model_wrapper is not None:
    moe_kwargs["mp_policy"] = getattr(model_wrapper, "mp_policy", None)

This ensured that a custom mp_policy on FSDP2Config (e.g. fp16 or custom reduce_dtype) propagated to expert sharding. The new code doesn't forward it — MoEParallelizerConfig.mp_policy defaults to None, and the MoE parallelizer falls back to its own hardcoded bf16/fp32 default.

For the default config this is identical (both default to bf16/fp32), but for users passing a custom mp_policy on FSDP2Config with EP models, the MoE sharding will silently ignore their precision choice. Consider restoring the forwarding:

Suggested change
if moe_parallel_config is None:
moe_parallel_config = MoEParallelizerConfig()
parallelize_fn = partial(
parallelize_model,
activation_checkpointing=activation_checkpointing,
**moe_parallel_config.to_dict(),
if moe_parallel_config is None:
moe_parallel_config = MoEParallelizerConfig()
moe_kwargs = moe_parallel_config.to_dict()
if moe_kwargs.get("mp_policy") is None and model_wrapper is not None:
moe_kwargs["mp_policy"] = getattr(model_wrapper, "mp_policy", None)
parallelize_fn = partial(
parallelize_model,
activation_checkpointing=activation_checkpointing,
**moe_kwargs,
)

Comment on lines +493 to +497
components/distributed/mesh.py
MeshContext -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config
Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size
STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig}
MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale documentation: this block describes the pre-refactor MeshContext. After this PR:

  • MeshContext no longer has strategy_config, pipeline_config, or moe_config — those moved to DistributedSetup.
  • STRATEGY_MAP was removed from mesh.py — it's now _STRATEGY_MAP in config.py.
Suggested change
components/distributed/mesh.py
MeshContext -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config
Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size
STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig}
MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD
components/distributed/mesh.py
MeshContext -- device_mesh, moe_mesh
Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size
MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD

Comment on lines +558 to +560
```
components/moe/config.py
MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale path: MoEParallelizerConfig was moved to components/distributed/config.py in this PR.

Suggested change
```
components/moe/config.py
MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.
components/distributed/config.py
MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.
components/moe/config.py
MoEConfig -- n_routed_experts, n_activated_experts, score_func, etc.

@HuiyingLi

Copy link
Copy Markdown
Contributor

/claude review

@HuiyingLi

Copy link
Copy Markdown
Contributor

/ok to test 300109d

Resolve conflicts from main (14 commits) against the distributed-config
refactor. Key resolutions:

- backend: keep the PR's removal of the configurable per-strategy `backend`
  (DDPConfig has no backend; managers don't take it; tests assert its
  absence). backend remains a process-group concern (dist_env / init).
- config.py / mesh.py / infrastructure.py: keep the PR's
  DistributedSetup/MeshContext structure and moe_parallel_config naming.
- activation checkpointing: keep the PR's design (carried on the parsed
  value and injected onto the strategy config later via
  infrastructure._with_activation_checkpointing, not in
  parse_distributed_section). Deduped a merge-duplicated
  _normalize_activation_checkpointing; updated the two selective-AC tests
  from main to assert the PR's behavior (AC stays off strategy_config).
- skills model-onboarding SKILL.md: take main's new "Declare model
  capabilities" section. moe/parallelizer.py: take main's _moe_shard_placement
  helper (it is used). _dist_utils.py: drop main's unused `import dataclasses`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi

Copy link
Copy Markdown
Contributor

/ok to test 994ad67

- infrastructure.py: forward the model wrapper's mp_policy (from FSDP2Config)
  to the MoE expert parallelizer when MoEParallelizerConfig.mp_policy is unset,
  so a custom precision policy isn't silently dropped for EP models.
- skills/nemo-automodel-distributed-training/SKILL.md: fix stale references —
  MeshContext no longer holds strategy_config/pipeline_config/moe_config and
  STRATEGY_MAP moved to _STRATEGY_MAP in config.py; MoEParallelizerConfig now
  lives in components/distributed/config.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi

Copy link
Copy Markdown
Contributor

/claude review

@HuiyingLi

Copy link
Copy Markdown
Contributor

/ok to test bbd2d61

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Clean, well-structured refactoring that consolidates distributed setup into a single DistributedSetup object. The new layering (topology in MeshContext, policies in DistributedSetup) is clear and consistent across all recipe callsites. Test coverage is thorough — all major new code paths (DistributedSetup.build(), _resolve_distributed_setup(), _reject_separate_distributed_kwargs(), the backend removal, MegatronFSDP aliases) have dedicated tests. Skill documentation is updated to match the new API. No bugs, logic errors, or typos found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants